Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval

نویسندگان

  • Wing-Kai Hon
  • Rahul Shah
  • Sharma V. Thankachan
چکیده

Let D ={d1, d2, ...dD} be a given set of D string documents of total length n, our task is to index D, such that the k most relevant documents for an online query pattern P of length p can be retrieved efficiently. We propose an index of size |CSA|+ n logD(2 + o(1)) bits and O(ts(p)+k log log n+poly log log n) query time for the basic relevance metric term-frequency, where |CSA| is the size (in bits) of a compressed full text index of D, with O(ts(p)) time for searching a pattern of length p . We further reduce the space to |CSA|+ n logD(1 + o(1)) bits, however the query time will be O(ts(p) + k(log σ log log n) 1+ǫ + poly log log n), where σ is the alphabet size and ǫ > 0 is any constant.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Top-k document retrieval in optimal space

We present an index for top-k most frequent document retrieval whose space is |CSA|+o(n)+D log n D+O(D) bits, and its query time is O(log k log 2+ n) per reported document, where D is the number of documents, n is the sum of lengths of the documents, and |CSA| is the space of the compressed suffix array for the documents. This improves over previous results for this problem, whose space complex...

متن کامل

Forbidden Extension Queries

Document retrieval is one of the most fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbi...

متن کامل

Improved Single-Term Top-k Document Retrieval

On natural language text collections, finding the k documents most relevant to a query is generally solved with inverted indexes. On general string collections, however, more sophisticated data structures are necessary. Navarro and Nekrich [SODA 2012] showed that a linear-space index can solve such top-k queries in optimal time O(m + k), where m is the query length. Konow and Navarro [DCC 2013]...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012